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SYSTEM AND METHOD FOR 
LOCALIZATION OF VIRTUAL SOUND 

FIELD OF THE INVENTION 

The field of present invention relates generally to virtual 5 
acoustics and binaural audio. More particularly, the field of 
the present invention relates to a virtual sound system and 
method for simulating spatially localized "virtual" sound 
sources from a limited number of actual speakers. 

BACKGROUND 10 

Over the past twenty years, considerable progress has 
been made in the field of virtual acoustics and binaural 
audio. Researchers in the field have advanced the under- 
standing of psychoacoustics by developing sound systems 
that can generate virtual sound sources — perceived sound 
sources that appear to the listener to originate in areas of 
space that are distinct from the actual physical location of 
the speakers. 

It is well understood in the field of virtual acoustics that 
a listener's localization of a sound source is largely a 
function of the difference of the sound wave fronts at each 
of the ears of the listener. Interaural time difference (ITD) 
refers to the delay in time, and interaural intensity difference 
(IID) refers to the attenuation in intensity, between "sound" 25 
perceived at the left and right ear drums of the listener. The 
brain uses these differences in the timing and magnitude of 
sounds between the ears to localize and identify the position 
in space from which the sound originates. 

At frequency differences between the left and right ear 30 
below about 1.5 kHz (i.e., frequencies where the wavelength 
is larger than the listener's head), a listener determines the 
position in space from which a sound originates based 
primarily on the difference in time at which the sound 
reaches (i.e., the ITO) the left and right ears of the listener. 35 
However, at frequency differences higher than about 1.5 
kHz, the spatial cue provided by the ITD is generally not 
sufficient for a listener to determine the location solely based 
on the ITD difference. 

Instead, at frequencies greater than approximately 500 Hz 49 
and less than 10 kHz, a listener may depend primarily on 
intensity differences in the sound received by the left and 
right ears of the listener (i.e., the IID). Variations in intensity 
levels between the left and right eardrums are interpreted by 
the human auditory system as changes in the spatial position 45 
of the perceived sound source relative to the listener. Thus, 
a virtual sound system can create a virtual or "3-D " sound 
affect by providing a listener with appropriate spatial cues 
(ITD, IID) for the desired location of the virtual sound 
image. 50 

However, in order to provide realistic and accurate virtual 
sound image, the sound system must also take into account 
the shape of the listener's head and the pinnae (or outer ear 
drum) of each ear of the listener. The pinnae for each ear 
imposes unique frequencydependent amplitude and time 55 
differences on an incoming signal for a given source posi- 
tion. The term Head-Related Transfer Functions (HRTF) is 
used to describe the frequencydependent amplitude and 
time-delay differences in perceived sound originating from 
a particular sound source that results from the complex eo 
shaping of the pinnae at the left and right ear drums of the 
listener. Thus, an effective virtual sound system provides 
ITD and IID spatial cues that have been modified to com- 
pensate for the spectral alterations of the HRTF of the 
listener. 65 

Several technical barriers exist to providing realistic vir- 
tual audio over conventional speakers. The sound heard at 



each ear of the listener is a mixture of signals from all of the 
speakers providing sound to the listener. This mixture of 
signals or "crosstalk" makes it very difficult to create a stable 
virtual sound image because of the enormous complexity 
involved in calculating how the different signals will mix at 
a listener's ear. For example, in a two-speaker system, sound 
signals from each of the two speakers will be heard by both 
ears and mix in an unpredictable manner to alter the spectral 
balance, ITD and IID differences in sound signals perceived 
by the listener. 

A theoretical solution for this dilemma, known as 
crosstalk cancellation, was originally proposed over 20 
years ago. Crosstalk cancellation presupposes that a sound 
system can add a binaural signal at each speaker that is the 
inverse (i.e., 180 degrees out of phase) of the crosstalk 
coming from a competing speaker, delayed by the difference 
in it takes the competing speakers sound to reach the 
opposite ear, to cancel the sound of the undesired speaker at 
a given ear. Thus, using crosstalk cancellation, a sound 
system can, in theory, assure that a listener's left ear hears 
the output of the left speaker and a listener's right ear hears 
the output of the right speaker. 

While systems have been implemented using crosstalk 
cancellation, several limitations have been encountered in 
conventional systems. In particular, the virtual effect may be 
restricted to a relatively small area at a specific distance and 
angle from the speakers. Outside this "sweet spot," the 
quality of the virtual sound effect may be greatly diminished. 
As a result, the number of listeners that may experience the 
virtual image at a time is limited. In addition, the virtual 
effect may be restricted to a narrow range of head positions 
within the "sweet spot," so a listener may lose the virtual 
sound effect entirely by turning his head. Such systems 
require the listener to remain in a fixed position relative to 
the speakers and, consequently, are impractical for many 
commercial applications. 

Such limitations make conventional crosstalk cancellation 
difficult to implement in practice. Effective crosstalk can- 
cellation typically requires precise knowledge of the loca- 
tion of the speakers, location of each listener and the head 
position of each listener. Deviations by the listeners from the 
expected physical location and head position relative to the 
speakers may result in a large and sudden attenuation of the 
virtual effect. 

Some systems have attempted to compensate for the 
above Limitations by limiting crosstalk cancellation to a 
particular band of frequencies. For example, crosstalk can- 
cellation may be limited to signals having frequencies 
between approximately 600 Hz to 10 kHz, an approximation 
of the frequency range over which the human auditory 
system can localize a sound source based primarily on the 
IID. This limitation of frequencies at which crosstalk is 
canceled increases the range of head movement that can 
occur within the predetermined sweet spot. 

What is needed is an improved system and method for 
localizing sound in a virtual system. Preferably such a 
system and method would provide a larger sweet spot and be 
less sensitive to head movement of listeners in the sweet 
spot. In addition, such a system and method would prefer- 
ably enhance the listeners* ability to perceive and differen- 
tiate the location of virtual sources. 

SUMMARY OF THE INVENTION 

One aspect of the present invention provides a system and 
method for providing improved virtual sound images. One 
or more spatial cues of an audio signal may be modulated 
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within a desired range to increase the clarity and perceived FIG. 4B is a block diagram showing in additional detail 

localization of the virtual sound image. Such modulation portions of a second embodiment of block 400 of FIG. 2, this 

may be used to cause the virtual source location to move block being designated as "Crosstalk Filter With Modulating 

slightly relative to the listener's head. Preferably, such Delay." 

movement k not consciously perceived by the listener. 5 F , G 5A fa a Wock di showi in addilional detail 

It is an advantage of this and other aspects of the present , ions of block 260 of FIG 2 , hjs block bei designated 

invention that virtual sound images may be provided to as , he .. Ste honic lm Enhancement System" in FIG. 

multiple listeners located within an enlarged sweet spot, 2 
with less sensitivity to the actual head position of the 

listeners. The modulation in the spatial cue(s) of an audio Q FIG * 5B is a chart showing the magnitude response of an 

signal and resulting unperceived "movement" of the virtual exemplary embodiment of filter 540 of FIG. 5A. 

source is believed to assist the auditory system in filtering FIG. 5C is a chart showing the phase response of an 

out ambiguous ITD, IID, and/or spectra spatial cues. exemplary embodiment of filter 540 of FIG. 5A. 

Another aspect of the present invention provides for a pjc. 6A is a block diagram of a multichannel virtual 

system and method for spatially shifting the perceived sound system according to an exemplary embodiment.of the 

virtual source location of an audio signal. A spatial shift present invention 

signal may be applied to an audio signal to modify one or , . 

more spatial cues (such as ITD, IID, spectra, or any com- F1G * 6B show , s the Positions of the actual and virtual 

bination thereof) to approximate the value of the spatial cues sources provided by an exemplary embodiment of the 

that would be produced if the audio signal were actually present invention. 

output from the location of the virtual source. The spatial 20 FIG. 7 is block diagram of a digital signal processor-based 

shift signal may be modulated prior to modifying the audio multichannel virtual sound system according to an exem- 

signal to enhance perceived localization as described above. plary embodiment of the present invention. 

Alternatively one or more spatial cues of the audio signal F]G 8 ^ a bk)ck dia of micr0 processor-based mul- 

may be modulated directly after the audio signal is modified tichannel yim ^ SQmd ffl accordi t0 an exemplary 

by the spatial shift signal. is embodiment of me presenl invention . 

Another aspect of the present invention provides a system _ ¥jni „ . 

and method for canceling crosstalk among a set of spatially FIG. 9 is a simplified block diagram illustrating a virtual 

shifted audio signals. A delayed, inverted signal may be sound system according to an alternate embodiment of the 

produced to cancel a crosstalk signal. The delay applied to present invention for generating multiple virtual sound 

one or more of the signals may be modulated within a 30 ima g es tnat are localized in space relative to the listener, 

desired range to enhance the perceived localization of the FIG. 10 is a block diagram showing in additional detail 

virtual sound image as described above. The ITD of the portions of block 700 of FIG. 9, this block being designated 

signal may be effectively modulated in this manner. as "HRTF Binaural Synthesis System with Modulating 

Another aspect of the present invention provides a system Binaural Attributes." 

and method for providing a more robust virtual sound image . 35 ni^rp tpttom 
A plurality of audio signals may be modified to have one or 

more spatial cues (such as ITD, IID, spectra, or any com- FIG. 1 is a simplified flow chart that is illustrative of an 

bination thereof) to approximate those that would be pro- embodiment of the presenl invention. In step 100, at least 

duced if the audio signals were actually output from the one audio input signal is received by the virtual sound 

location of one or more virtual sources. Crosstalk among the 40 system. This audio input signal may be any typical analog or 

audio signals may be canceled. The resulting audio signals digital audio input signal. In step 101, the virtual sound 

may then be enhanced to increase the depth of the sound system retrieves a spatial shift signal that is associated with 

perceived by the listener. It is an advantage of this and other the desired location (relative to the speakers and listeners of 

aspects of the present invention that a more robust virtual the virtual sound system) of the virtual sound source. The 

sound image representing multiple virtual sources may be 45 spatial shift signal may be a set of coefficients or a continu- 

produced without noticeable crosstalk interference. ous signal or other values that may be applied to an audio 

BRIEF DESCRIPTION OF THE DRAWINGS signal to modify one or more spatial cues of the audio signal. 

For instance, the spatial shift signal may represent a time 

These and other features and advantages of the present delay t0 modify ITD) an amp lit u de shift to modify IID, or a 

invention will become more apparent to those skilled in the 5Q magni t u de by which to shift the spectra to modify the 

art from the following detailed description in conjunction spectral attributes of the audio signal. In the exemplary 

with the appended drawings in which: embodiment, the spatial shift signal comprises the direction 

FIG. lis a flow chart illustrating a process for generating specific impulse response ("DSIR") associated with the 

multiple virtual sound images that are localized in space desired location of the virtual sound source. The DSIR 

relative to the listener in accordance with an exemplary 55 comprises the coefficient values (for the left and right ears of 

embodiment of the present invention. listeners) used by an exemplary embodiment of the present 

FIG. 2 is a block diagram of a virtual sound system invention to modify at least one spatial cue of the audio input 
according to an exemplary embodiment of the present signal in order to produce the desired binaural attribute of 
invention for generating multiple virtual sound images that the virtual sound source. While the DSIR preferably corn- 
are localized in space relative to the listener. 60 prises coefficients from complex HRTFs that take into 

FIG. 3 is a block diagram showing in additional detail account the ITD, IID and spectral shift of an audio signal, 

portions of block 300 of FIG. 2, this block being designated any variety of spatial shift signals may be used to modify the 

as the "HRTF Binaural Synthesis System" in FIG. 2. binaural attributes of the audio signal. 

FIG. 4A is a block diagram showing in additional detail In step 102, the virtual sound system uses the DSIR to 

portions of one embodiment of block 400 of FIG. 2, this 65 modify the binaural attribute of the audio input signal. As 

block being designated as "Crosstalk Filter With Modulating shown below, the modification of the binaural attribute of 

Delay." the audio input signal, may be performed by an HRTF 
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Binaural Synthesis System. One of the results of step 102 is exemplary embodiment of the present invention is less 
a pair of "binaural" output signals, one for each ear, for each sensitive to the head movement of listeners. The spatial cue 
audio input signal that is associated with a specific virtual changes that would be associated with normal head move- 
source location. The term ipsilateral is used to designate the ment are subsumed within the modulation of the spatial cues 
signal associated with the ear closer to the sound source and 5 by the system of the exemplary embodiment, 
the term contralateral is used to designate the signal that Finally, the "sweet spot" of the exemplary embodiment of 
associated with the ear that is further from the virtual source FIG. 1 is enlarged over typical conventional virtual sound 
location. These "binaural pair" of signals possess the spatial systems which are dependent on a listener being at a 
cues for the left and right ears of the listener. Together, the specified position relative to the speakers (i.e., at a position 
binaural pair of signals will produce the binaural attribute of 1() with a predetermined set of spatial cues). The "moving" 
the virtual sound source. The applicable DSIR coefficients aature of the virtua] sound source mcrea ses the area over 
may be applied to one or both of the ipsilateral and con- which |he viflual sound effect can be erceived and allows 
trilateral signals to spatially shift the virtual sound image a ]istener tQ duaU enter and exft the effect> with ^ 
that will be produced. For instance, the DSIR (or other ^ ^ often 
spatial shift signal) may cause one signal to be delayed, . . . , « £ i « . •_ ^ 
and/or its intensity to be increased or decreased, and/oV its « f x P enenccs an abrupt drop off of the virtual effect when the 
spectra to be modified relative to the other signal to change hstener moves from ,he s P ecific sweet ^ and head P 051 " 
the perceived location of the virtual source. The spatial shift 11011 • 

signal may include delay values (which may represent, for ^ 2 « a simplified block diagram of virtual sound 

instance, the number of clock cycles to delay one signal) or system according to an exemplary embodiment of the 

intensity or spectral shift values (which may be multiplied or 20 present invention. The virtual sound system includes HRTF 

added to the signal to change its intensity or spectra). Binaural Synthesis System 220, Crosstalk Filter With Modu- 

In step 103, the localization and integrity of the virtual lating Delay 240, a Stereophonic Image Enhancement Sys- 

sound source perceived by a listener is improved by modu- tem 260 and speakers 20 and 30. HRTF Binaural Synthesis 

lating the value of at least one of the spatial cues within at System 220 receives a plurality of audio input signals 201 

least one of the binaural pair of output signals created in step 25 and then proceeds to modify the binaural attribute of each 

102. The term modulating or modulation refers to varying a audio input signal such that each audio input signal is 

value (e.g., a spatial cue) within a desired range at a specified transformed into a binaural pair of output signals that 

rate. The spatial shift signal itself may be modulated prior to possess the binaural attribute of the desired virtual sound 

being applied to the audio signal(s) or the spatial cues of the source. For example where the number of audio input 

audio signal(s) may be modulated directly (e.g., by applying 30 signal equals two (2), the HRTF Binaural Synthesis System 

a varying delay to the signal). 220 provides the Crosstalk Filter With Modulating Delay 

In the exemplary embodiment of FIG. 1, the modulation 240 with two (2) binaural pair of signals 211 and 212. Each 

of the spatial cue has the effect of continuously "moving" the binaural pair of signals is comprised of two signals — the 

position of the virtual sound source relative to the head of a ipsilateral and contralateral signals. The Crosstalk Filter 

hstener (or, in other words, "varying" the head position of 35 With Modulating Delay 240 performs a crosstalk cancella- 

the listener relative to the position of the virtual sound tion operation on the binaural pair of signals 211 and 212. 

source). Studies have shown that (i) the position of moving During this crosstalk cancellation the Crosstalk Filter With 

sound sources is better localized by listeners than the Modulating Delay 240 modulates the ITD of one or more of 

position of static sound sources and (ii) a listener who is the signals such that at least one spatial cue is varied in a 

allowed to vary his or her head position during the local- 40 range and at a rate just below the jnd value for the spatial 

ization process can more accurately localize the position of cue. Crosstalk Filter With Modulating Delay 240 then pro- 

a sound source than a listener whose head position remains vides the Stereophonic Image Enhancement System 260 

fixed during localization. This is because the changes in with an input signal associated with each speaker (20 or 30). 

ITD, IID and spectra that occur with either (i) sound source Stereophonic Image Enhancement System 260 processes 

movement or (ii) head movement assist the auditory system 45 signals 401 and 402 to increase the "robustness" or depth of 

in filtering out ambiguous ITD, IID and/or spectra spatial the virtual image. The output of Stereophonic Image 

cues. Enhancement System 260 is sent to speakers 20 and 30. 

However, in the exemplary embodiment shown in FIG. 1, FIG. 3 is simplified block diagram illustrating the HRTF 

modulation of a spatial cue would be undesirable if it altered Binaural Synthesis System 220 in further detail. Referring to 

the perceived location of the virtual sound source or the 50 FIG. 3, the HRTF Binaural Synthesis System includes a 

tonal quality of the virtual sound. Neither effect occurs in the convolution engine 310 for modifying the binaural attributes 

exemplary embodiment. The perceived location of the vir- of audio input signal 201 and memory 330 for the storage of 

tual sound source remains "fixed" because (1) the values of the spatial shift signals (e.g., the direction specific binaural 

the spatial cue are modulated about the desired spatial cue impulse responses) for the left and right ears. The convo- 

value so that the average position is at the desired value and 55 lution engine 310 multiplies the spectra of each of the input 

(2) the magnitude (i.e., range) of changes in the spatial cue signals 201 with the spectra of the appropriate direction 

are set to a level below the "just noticeable difference" specific binaural impulse response stored in memory 330 to 

("jnd") level for the modulated spatial cue. The jnd of a create the proper binaural pair of output signals associated 

spatial cue is the magnitude of change below which the with a particular virtual source. For example, if the number 

human auditory system does not consciously perceive a 60 of audio input signals is equal to two (2), the HRTF Binaural 

difference in the nature of sound being heard. Thus, a Synthesis System will produce two (2) binaural pairs of 

listener's ability to localize a virtual source may be signals, 211 and 212. Each binaural pair of output signals 

improved by changing ITD, IID or spectra spatial cues possesses the proper binaural attributes of the virtual sound 

without causing associated changes in perceived pitch or source associated with a particular input signal. The convo- 

tone. es lution engine 310 provides functionality similar to one or 

Moreover, because the virtual source is always, in effect, more finite impulse response ("FIR") filters or infinite 

moving relative to the head position of the listener, the impulse response ("I IR") filters. A description of the use of 
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convolution, digital filters and virtual sound may be found in 
"3-D Sound for Virtual Reality and Multimedia" by Durand 
R. Begault (1994), which is hereby incorporated herein by 
reference in its entirety. 

There are many well-known types of HRTF binaural 
synthesis in the field of virtual acoustics and binaural audio. 
Exemplary embodiments may use, but are not limited to, any 
combination of (i) FIR a nd/or HR filters (digital or analog) 
andjn) spatial shift signals (e.g., coefficients) generated 
using any of the following methods: 

raw impulse response acquisition; 

balanced model reduction; 

hankel norm modeling; 

least square modeling; 

modified or unmodified Prony methods; 

minimum phase reconstruction; 

Iterative Pre-filtering; or 

Critical Band Smoothing. 
For a further explanation of the above methods see J. Smith 
III, Ph.D. dissertation report (# Stan-M-14) entitled "Tech- 
niques for Digital Filter Design and System Identification 
with Application to t the Violin" and. in C. Lueck, Ph.D. 
dissertation report (Iowa State University 1995) entitled 
"Modeling of Head Related Transfer Functions for Reduced 
Computation and Storage," each of which is hereby incor- 
porated herein by reference in its entirety. 

FIG. 4A is a simplified block diagram illustrating the 
operation of the Crosstalk Filter With Modulating Delay 240 
that performs the crosstalk operation on the binaural pair 
signals 211 and 212. However, in this embodiment, the 
crosstalk operation is only performed on the ipsi lateral 
signal of each binaural pair. The contralateral signals of 
binaural pairs 211 and 212 are ignored by the crosstalk filter 
(i.e., grounded) because the contralateral signal is often 35 
negligible for common speaker-based configurations, In 
blocks 420 and 421, a delay is imposed on the crosstalk 
compensation signals 311 and 312 to compensate for the 
time it takes an undesired crosstalk signal to reach the 
opposite ear of the listener where such signal 211 or 212 is 
to be canceled. The delays in blocks 420 and 421 are 
modulated by modulators 450 and 451 such that the ITD 
delays imposed on the crosstalk compensation signals 311 
and. 312 are modulated between approximately 0.09 msec 
and 2.25 msec at a modulation rate of between about 0.5 and 
1.5 Hz in the time or frequency domain. The modulation rate 
of between about 0.5 and 1.5 Hz approximates the listener 
slightly turning his head back and forth at a rate of between 
about once every 2 seconds and once every V3 second. After 
passing through delay blocks 420 and 421, crosstalk com- 
pensation signals 311 and 312 pass through lowpass filters 
430 and 431 which cutoff a portion of the signal above a set 
frequency. Typically, the cut off frequency for the low pass 
filter is set at approximately 8 kHz. It has been found that the 
best crosstalk cancellation effect occurs if the gain for 55 
lowpass filters 430 and 431 is set at about V4 the power of the 
signal to be canceled. The crosstalk compensation signals 
311 and 312 and signals 211 and 212 are then summed 
together as shown at junction 441 and 442 and sent to the 
speakers as signals 401 and 402 either directly or after any 60 
subsequent audio enhancement or processing. 

FIG. 4B is a simplified block diagram illustrating the 
operation of another exemplary embodiment of the 
Crosstalk Filter With Modulating Delay 240 that performs 
the crosstalk operation on both the ipsilateral and contralat- 
eral signals of binaural pairs 211 and 212. In this 
embodiment, processed contralateral signals 211B, 212A 



40 



50 



65 



and ipsilateral signals 2 11 A, 212B are crosstalk canceled 
separately before finally being summed together at junctions 
484 and 485 and output as signals 401 and 402. 

Signal 211A is the ipsilateral signal intended to be output 
from speaker signal 401 (which may be output, for instance, 
from the left speaker). Signal 211B is the corresponding 
contralateral signal intended to be output from speaker 
signal 402 (which may be output, for instance, from the right 
speaker). The contralateral signal is delayed by block 426 (to 
account for propagation delay of the corresponding crosstalk 
produced by the contralateral signal from the right speaker) 
and passed through low pass filter 435. It is then inverted at 
stage 482 and combined with ipsilateral signal 211A. The 
inverted signal is thereby provided to the left speaker to 
cancel any corresponding crosstalk produced by the con- 
tralateral signal from the right speaker. 

Additional signals are also sent to the left speaker in the 
system of FIG. 4B. These signals include (i) the contralateral 
signal 212 A from the other (e.g., right) binaural pair and (ii) 
the delayed inverse of the ipsilateral signal 212B from the 
right binaural pair (to cancel crosstalk). Ipsilateral signal 
212B is delayed by block 424 (to account for propagation 
delay of the corresponding crosstalk produced by the ipsi- 
lateral signal from the right speaker) and passed through low 
pass filter 433. It is then inverted at stage 481 and combined 
with contralateral signal 212 A before being sent to the left 
speaker. 

The signals to be sent to the left speaker are summed 
together at stage 484 to produce speaker signal 401. As 
described above, these signals include: (i) the ipsilateral 
signal 211 A from the left binaural pair and the contralateral 
signal 212A from the right binaural pair; and (ii) delayed, 
inverted signals to cancel crosstalk from the contralateral 
signal 211B from the left binaural pair and the ipsilateral 
signal 212B from the right binaural pair. 

Similar processing is used to produce speaker signal 402 
for the right speaker. The signals to be sent to the right 
speaker are summed together at stage 485 to produce 
speaker signal 402. These signals include: (i) the ipsilateral 
signal 212B from the right binaural pair and the contralateral 
signal 211 B from the left binaural pair; and (ii) delayed, 
inverted signals to cancel crosstalk from the contralateral 
signal 212 A from the right binaural pair and the ipsilateral 
signal 211 A from the left binaural pair. 

In addition to the foregoing, in the embodiment of FIG. 
4B, delay stages 428 and 427 are applied to contralateral 
signals 211B and 212 A respectively. The delays imposed by 
these stages are modulated by modulators 452 and 453 
respectively. These delay stages and modulators vary the 
ITD attribute of the audio signal in a manner similar to delay 
stages 420 and 421 and modulators 450 and 451 described 
above with reference to FIG. 4A. As described above, the 
ITD may be modulated between approximately 0.09 msec 
and 2.25 msec at a modulation rate of between about 0.5 and 
1.5 Hz in the lime or frequency domain. Preferably, the ITD 
is varied in a manner that has the effect of slightly moving 
the virtual source location relative to the listener's head to 
enhance the ability of the listener to localize the virtual 
source. As described above, however, such "movement" 
preferably is not consciously perceived by the listener. 

Delay blocks 423, 424, 425, 426, 427 and 428 represent 
time delays. For example, in a digital system, a delay block 
may be represented mathematically as: x(s-d), where x is the 
signal at a given sample, s is the current sample and d is the 
number of samples of delay. Modulators 452 and 453 
operate at frequencies of between about 0.5 Hz and 1.5 Hz. 
Modulation may be accomplished in either the time or 
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frequency domains, and by any number of modulation 
signals, not limited to sine, triangle, square, sawtooth, or 
random waveforms. The modulation function need not be 
periodic. The desired effect could be achieved by generating 
random values around the desired spatial cue value. It has 5 
been found that a periodic triangle waveform provides a 
preferred localization effect for listeners. 

FIG. 5A illustrates the stereophonic image enhancement 
system shown as block 260 of FIG. 2 in additional detail. 
TTiis stereophonic image enhancement system is similar in 
effect to the automatic stereophonic image enhancement 
system described and claimed in U.S. Pat. No. 5,412,731, 
which is incorporated herein by reference in its entirety. At 
junction 510, signal 401 is summed with the inverse of 
signal 402. The result of this summation is then passed 
through filter 540. Filter 540 is a low pass filter having the 15 
characteristics shown in FIGS, 5B (magnitude response) and 
5C (phase response). At junction 520, signal 401 is summed 
with the output of filter 540 and sent to speaker 20. At 
junction 530, signal 402 is summed with the inverse of the 
output of filter 540 and sent to speaker 30. It has been found 20 
that connection of the stereophonic image enhancement 
system 260 to the output of the Crosstalk Filter With 
Modulating Delay 240 improves the quality of the virtual 
sound by increasing the depth of the sound perceived by the 
listener. 25 

FIG. 6A is a block diagram of a multichannel virtual 
sound system according to an exemplary embodiment of the 
present invention. Input audio signal 600 is decoded by 
multichannel decoder 610 into a plurality of channel signals 
615. Multichannel decoder 610 may be any standard mul- 30 
tichannel decoder including without limitation multichannel 
decoders such as Dolby AC-3, MPEG-2 and MPEG-3. 
These channel signals are then processed through an HRTF 
Binaural Synthesis System 620 which, except for the num- 
ber of channel signals, may be identical to the HRTF 35 
Binaural Synthesis System 220 that is shown in FIGS. 2 and 
3. The HRTF Binaural Synthesis System 620 provides each 
channel signal with the proper binaural attributes for its 
intended virtual spatial position. The plurality of output 
signals 615, which constitute a binaural pair of output 40 
signals for each channel signal from HRTF Binaural Syn- 
thesis System 620, are then processed through the Crosstalk 
Filter with Modulating Delay 640. For each binaural pair, 
Crosstalk Filter with Modulating Delay 640 may be identical 
to Crosstalk Filter With Modulating Delay 240. 45 

FIG. 6B shows the positions of the actual and virtual 
sources which may be provided by an exemplary embodi- 
ment of the present invention. In such an embodiment, a 
surround sound effect may be produced from only two actual 
speakers, a left speaker 650 and a right speaker 660. In 50 
contrast to an actual surround sound system, which also uses 
center, left side and right side speakers, this embodiment 
uses a virtual center source 670, virtual left side source 680 
and a virtual right side source 690. The virtual sources are 
simulated by providing spatially shifted audio signals from 55 
the left speaker 650 and right speaker 660. 

Such an embodiment may be implemented as shown in 
FIG. 6A for example. An audio signal 600 with surround 
sound encoded information is processed by Multichannel 
Decoder 610. The Multichannel Decoder 610 may be a 60 
Dolby AC-3 decoder which produces a separate audio signal 
608 for each surround sound speaker — a left, center, right, 
left side and right side audio signal. A low frequency signal 
may also be produced and, optionally, may be simulated in 
the same manner as the center speaker as described below. 65 

In the exemplary embodiment, the various signals to be 
provided to the left speaker 650 and right speaker 660 are 



summed together. The left and right surround sound signals 
are passed directly to the left and right speakers respectively. 
The virtual center source 670 is simulated by reducing the 
center surround sound signal by approximately 3 decibels 
(i.e., dividing the signal by approximately the square root of 
2). The reduced center surround sound signal is then passed 
to both the left speaker 650 and right speaker 660. Any 
optional low frequency surround sound signal may be vir- 
tualized in a similar manner. 

The virtual left side source 680 and virtual right side 
source 690 are produced using an HRTF Binaural Synthesis 
System 220 and Crosstalk Filter with Modulating Delay 240 
as described in conjunction with FIG. 2 above. With the 
configuration shown in FIG. 6B, the contralateral signals 
which would be produced by a left side source and right side 
sources would be insubstantial. Accordingly, only ipsilateral 
signals need to be processed as described above in conjunc- 
tion with FIG. 4A. The resulting binaural signals (with 
crosstalk compensation signals) for the virtual left side 
source 680 and virtual right side source 690 are then 
provided to the left speaker 650 and right speaker 660 as 
applicable. The audio signals for the virtual left side 680 and 
virtual right side source 690 preferably have at least one 
modulated spaiial cue to enhance the perceived localization 
of listener 675 as described above. While not consciously 
perceived, the slight variance in the virtual left side source 
680 and the virtual right side source 690 improves localiza- 
tion relative to completely static virtual sources. 

Once all of the signals for the left speaker 650 and right 
speaker 660 are summed together, they may be optionally 
passed through a Stereophonic Image Enhancement System 
260 as described above with respect to FIGS. 2, 5A, 5B and 
SC. The resulting signals provide a robust virtual sound 
effect with only two actual speakers. 

FIG. 7 is a simplified block diagram of a digital signal 
processor-based multichannel virtual sound system ("DSP 
System") that may be used to implement a variety of 
exemplary embodiments of the present invention. The DSP 
system includes a digital signal processor 700, microcon- 
troller 710, memory 720, multichannel decoder 730 and 
speakers 20 and 30. Digital signal processor 700 may be any 
standard digital signal processor that is capable of perform- 
ing the necessary calculations for real time processing of the 
incoming audio stream. Exemplary digital signal processors 
include without limitation Motorola 56000 series, Zoran 
38000 series and Texas Instruments TMS 320 series. The 
digital signal processor 700 in the exemplary embodiment 
may perform, but is not limited to, the functions of a: (i) 
convolution engine and (ii) crosstalk filter with modulating 
delay. Additionally, in other embodiments, the digital signal 
processor may perform the functions of the multichannel 
decoder 730. Microcontroller 710 may be any standard 
microcontroller that may be used to respond to user requests 
and control the operation of the DSP system. Memory 720 
may be any form of computer memory including without 
limitation ROM, EPROM, EEPROM and Flash EEPROM 
memory. Memory 720 should be sufficient for the storage of 
the spatial shift signals (e.g., direction specific binaural 
impulse responses) for the left and right ears. Speakers 20 
and 30 may be any conventional speakers. 

FIG. 8 is a simplified block diagram of a microprocessor 
(or CPU) based multichannel virtual sound system ("CPU 
System") that may be used to implement a variety of 
exemplary embodiments of the present invention. The CPU 
system includes a microprocessor 800, memory 810, mul- 
tichannel decoder 820 and speakers 20 and 30. Micropro- 
cessor 800 may be any standard microprocessor capable of 
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performing the necessary calculations for real time process- 
ing of the incoming audio stream. Exemplary microproces- 
sors include without limitation the Intel Pentium MMX, 
Intel Pentium II, Power PC and the DEC Alpha micropro- 
cessors. The microprocessor 800 in the exemplary embodi- 5 
ment may perform, but is not limited to, the functions of a: 
(i) convolution engine and (ii) crosstalk filter with modu- 
lating delay. Additionally, in some embodiments, the digital 
signal processor may perform all the functions of the mul- 
tichannel decoder 820. Memory 820 may be any form of 10 
computer memory including without limitation ROM, 
PROM, EEPROM, Flash EEPROM memory, DRAM or 
SRAM. Memory 820 should be sufficient for the storage of 
the spatial shift signals (e.g., direction specific binaural 
impulse responses) for the left and right ears. Speakers 20 15 
and 30 may be any conventional speakers. 

FIG. 9 is a simplified block diagram of a virtual sound 
system 900 according to an alternate embodiment of the 
present invention which generates localized virtual images 
by modulating a specific spatial cue in the HRTF Binaural 20 
Synthesis System 910. Referring to FIG. 9, audio input 
signals 905 are provided to HRTF Binaural Synthesis Sys- 
tem 910. The HRTF Binaural Synthesis System 910 contains 
a spatial shift signal that is associated with the desired 
location (relative to the speakers and listeners of the virtual 25 
sound system) of the virtual sound source. In this 
embodiment, the spatial shift signal is the direction specific 
impulse response ("DSIR") for the desired location of the 
virtual sound source. The DSIR comprises the coefficient 
values (for the left and right ears of listeners) used by an 30 
exemplary embodiment of the present invention to modify at 
least one spatial cue of the audio input signals in order to 
produce the desired binaural attribute of the virtual sound 
source. The coefficient values may be, for instance, a time 
delay to modify the ITD binaural attributes of the audio 35 
input signals, an amplitude shift to modify the IID binaural 
attributes of the audio input signals, a magnitude by which 
to shift the spectra to modify the spectral attributes of the 
audio input signals, or a combination of the foregoing. The 
spatial shift signal may be used to modify the respective 40 
spatial cues of the audio signals to produce localized values 
for the spatial cues. The localized values for the spatial cues 
approximate values that would be produced if the audio 
signal were actually output from the desired location of the 
virtual source (i.e., at a certain offset from the actual speaker 45 
location). 

In the embodiment of FIG. 9, however, a spatial shift 
signal for at least one of the spatial cues is modulated before 
being applied to the input audio signals. For instance, a 
spatial shift signal for IID or spectra shift (or the spatial cues 50 
in the audio signal itself) may be modulated between 
approximately 0.25 decibels and 1.5 decibels at a modula- 
tion rate of between about 0.5 and 1.5 Hz in the time or 
frequency domain. As described above, the spatial shift 
signal for ITD (or the spatial cue in the audio signal itself) 55 
may also be modulated between approximately 0.09 msec 
and 2.25 msec at a modulation rate of between about 0.5 and 
1.5 Hz in the time or frequency domain. Any combination of 
the foregoing spatial cues may be modulated by modulating 
the spatial shift signal before applying it to the audio 60 
signal(s) or by modulating the spatial cues in the audio 
signal directly. Preferably, one or more of the spatial cues is 
varied in a manner that has the effect of slightly moving the 
virtual source location relative to the listener's head to 
enhance the ability of the listener to localize the virtual 65 
source. As described above, however, such "movement" 
preferably is not consciously perceived by the listener. 
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FIG. 10 is a simplified block diagram illustrating the 
operation of HRTF Binaural Synthesis System With Modu- 
lating Binaural Attributes 910 in additional detail. As shown 
in FIG. 10, the HRTF Binaural Synthesis System With 
Modulating Binaural Attributes 910 includes a convolution 
engine 940, memory 950 for the storing the direction spe- 
cific binaural impulse responses for the left and right ears 
and a modulator 960. The modulator 960 modulates the 
direction specific binaural impulse responses for one or 
more of the spatial cues as described above. After such 
modulation, the modulated direction specific binaural 
impulse responses are applied to the input audio signals 905 
by convolution engine 940. The resulting signals 915 are 
modulated pairs of "binaural" output signals, one for each 
ear, for each audio input signal that is associated with a 
specific virtual source location. Except for the slight vari- 
ance due to the modulation, the binaural attributes of the 
output signals 915 are modified to produce audio signals 
from the physical speakers which are representative of those 
that would be produced if the audio signal were actually 
output from the desired location of the virtual source (i.e., at 
a certain offset from the physical speaker location). 

As shown in FIG. 9, the modified output signals 915 are 
then provided to Crosstalk Cancellation Filter 920 to cancel 
the effects of crosstalk. The filter 920 may be similar to 
Crosstalk Filter With Modulating Delay 475 described 
above, except that the modulators 452 and 453 are removed, 
because the desired modulation has already been introduced 
by HRTF Binaural Synthesis System 910. After crosstalk 
cancellation, the resulting signals 401 and 402 may be sent 
to speakers 20 and 30. As described above, an optional 
stereophonic image enhancement system (such as 260 in 
FIG. 2) may be interposed between Crosstalk Cancellation 
Filter 920 and speakers 20 and 30. 

While the present invention has been described and 
illustrated with reference to particular embodiments, it will 
be readily apparent to those skilled in the art that the scope 
of the present invention is not limited to the disclosed 
embodiments but, one the contrary, is intended to cover 
numerous other modifications and equivalent arrangements 
which are included within the spirit and scope of the 
following claims. 

What is claimed is: 

1. A method for producing an output audio signal per- 
ceived by a listener to originate from a virtual source, said 
method comprising the steps of: 

receiving an audio signal to be output on a speaker system 
at a position offset from the location of the virtual 
source; 

providing a spatial shift signal for modifying a spatial cue 
of the audio signal, wherein the spatial cue is selected 
from the group consisting of interaural time difference, 
interaural intensity difference and spectra; 

using the spatial shift signal to modify the spatial cue of 
the audio signal to produce a localized value for the 
spatial cue, wherein the localized value for the spatial 
cue approximates a value for the spatial cue that would 
be produced if the audio signal were actually output 
from the location of the virtual source; 

modulating the value of the spatial cue of the audio signal 
within a desired range around the localized value to 
enhance the ability of the listener to perceive the 
location of the virtual source; and 

outputting the modified and modulated audio signal from 
the speaker system. 

2. The method of claim 1, wherein the step of modulating 
the value of the spatial cue further comprises the step of 
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varying the spatial shift signal before using the spatial shift 
signal to modify the spatial cue of the audio signal. 

3. The method of claim 1, wherein the step of modulating 
the value of the spatial cue further comprises the step of 
varying the audio signal after using the spatial shift signal to 5 
modify the spatial cue of the audio signal. 

4. The method of claim 1, wherein the step of using the 
spatial shift signal to modify the spatial cue of the audio 
signal further comprises the step of producing at least two 
spatially shifted audio signals, the method further compris- to 
ing the step of adding crosstalk compensation signals to each 

of the spatially shifted audio signals. 

5. The method of claim 4, wherein each of the spatially 
shifted audio signals is an ipsilateral signal. 

6. The method of claim 1, wherein the step of using the 15 
spatial shift signal to modify the spatial cue of the audio 
signal further comprises the step of producing at least two 
binaural pairs of audio signals, the method further compris- 
ing the step of generating crosstalk compensation signals for 
each of the binaural pairs of audio signals. 20 

7. The method of claim 1, wherein the spatial cue com- 
prises interaural time difference. 

8. The method of claim 7, wherein modulating the value 
of the spatial cue of the audio signal within a desired range 
comprises modulating the interaural time difference between 25 
0.09 milliseconds and 2.25 milliseconds around the local- 
ized value. 

9. The method of claim 8, wherein the value of the 
interaural time difference is modulated at a rate between 0.5 
and 1.5 Hz in the time domain. 30 

10. The method of claim 8, wherein the value of the 
interaural time difference is modulated at a rate between 0.5 
and 1.5 Hz in the frequency domain. 

11. The method of claim 1, wherein the spatial cue 
comprises interaural intensity difference. 35 

12. The method of claim 11, wherein modulating the value 
of the spatial cue of the audio signal within a desired range 
comprises modulating the interaural intensity difference 
between 0.25 decibels and 1.5 decibels around the localized 
value. 40 

13. The method of claim 12, wherein the value of the 
interaural intensity difference is modulated at a rate between 
0.5 and 1.5 Hz in the time domain. 

14. The method of claim 12, wherein the value of the 
interaural intensity difference is modulated at a rate between 45 
0.5 and 1.5 Hz in the frequency domain. 

15. The method of claim 1, wherein the spatial cue 
comprises spectra. 

16. A system for producing an output audio signal per- 
ceived by a listener to originate from a virtual source, the 50 
system comprising: 

a processor operatively coupled to a memory; 

the memory containing a spatial shift signal; 

the processor receiving an input audio signal and modi- 55 
fying the input audio signal in accordance with the 
spatial shift signal to produce at least two spatially 
shifted signals that, in combination, possess the 
approximate localized value of spatial cues that would 
be produced if signals were actually output from the 6Q 
location of the virtual source; 

a crosstalk compensation circuit; 

the crosstalk compensation circuit generating at least one 
crosstalk compensation signal to compensate for 
crosstalk between the at least two spatially shifted 65 
signals; 
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a modulator for varying at least one spatial cue around the 
localized value for the at least two spatially shifted 
signals; and 

a speaker system for outputting the at least two spatially 
shifted signals with the varying spatial cue and the at 
least one crosstalk compensation signal. 

17. The system of claim 16, wherein the modulator varies 
the spatial shift signal in order to vary the at least one spatial 
cue for the at least two spatially shifted signals. 

18. The system of claim 16, wherein the modulator varies 
the crosstalk compensation signal in order to vary the at least 
one spatial cue for the at least two spatially shifted signals. 

19. A method for producing an output audio signal per- 
ceived by a listener to originate from a virtual source, said 
method comprising the steps of: 

receiving an audio signal to be output on a speaker system 
at a position offset from the location of the virtual 
source; 

providing a spatial shift signal for modifying a spatial cue 
of the audio signal, wherein the spatial cue is selected 
from the group consisting of interaural time difference, 
interaural intensity difference and spectra; 

using the spatial shift signal to modify the spatial cue of 
the audio signal to produce a localized value for the 
spatial cue, wherein the localized value for the spatial 
cue approximates a value for the spatial cue that would 
be produced if the audio signal were actually output 
from the location of the virtual source; 

modulating the value of the spatial cue of the audio signal 
within a desired range around the localized value to 
enhance the ability of the listener to perceive the 
location of the virtual source, wherein the desired range 
within which the value of the spatial cue is modulated 
comprises a range below the just noticeable difference 
("jnd") level of the spatial cue; and 

outputting the modified and modulated audio signal from 
the speaker system. 

20. The method of claim 19, wherein 

the spatial cue comprises interaural time difference, 
modulating the value of the spatial cue of the audio signal 
within a desired range comprises modulating the inter- 
aural time difference between 0.09 milliseconds and 
2.25 milliseconds around the localized value, and 
the value of the interaural time difference is modulated at 
a rate between 0.5 and 1.5 Hz in the time domain or the 
frequency domain. 

21. The method of claim 19, wherein 

the spatial cue comprises interaural intensity difference, 
modulating the value of the spatial cue of the audio signal 
within a desired range comprises modulating the inter- 
aural intensity difference between 0.25 decibels and 1.5 
decibels around the localized value, and 
the value of the interaural intensity difference is modu- 
lated at a rate between 0.5 and 1.5 Hz in the time 
domain or the frequency domain. 

22. The method of claim 19, wherein the spatial cue 
comprises interaural time difference. 

23. The method of claim 19, wherein the spatial cue 
comprises interaural intensity difference. 

24. The method of claim 19, wherein the spatial cue 
comprises spectra. 
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